The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Thu Jun 4 17:52:11 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Thu Jun  4 17:52:11 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.3.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.3.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.3.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.3.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 134 134
Albania 134 134
Algeria 134 134
Andorra 134 134
Angola 134 134
Antigua and Barbuda 134 134
Argentina 134 134
Armenia 134 134
Australia 1072 1072
Austria 134 134
Azerbaijan 134 134
Bahamas 134 134
Bahrain 134 134
Bangladesh 134 134
Barbados 134 134
Belarus 134 134
Belgium 134 134
Belize 134 134
Benin 134 134
Bhutan 134 134
Bolivia 134 134
Bosnia and Herzegovina 134 134
Botswana 134 134
Brazil 134 134
Brunei 134 134
Bulgaria 134 134
Burkina Faso 134 134
Burma 134 134
Burundi 134 134
Cabo Verde 134 134
Cambodia 134 134
Cameroon 134 134
Canada 1876 1876
Central African Republic 134 134
Chad 134 134
Chile 134 134
China 4422 4422
Colombia 134 134
Comoros 134 134
Congo (Brazzaville) 134 134
Congo (Kinshasa) 134 134
Costa Rica 134 134
Cote d’Ivoire 134 134
Croatia 134 134
Cuba 134 134
Cyprus 134 134
Czechia 134 134
Denmark 402 402
Diamond Princess 134 134
Djibouti 134 134
Dominica 134 134
Dominican Republic 134 134
Ecuador 134 134
Egypt 134 134
El Salvador 134 134
Equatorial Guinea 134 134
Eritrea 134 134
Estonia 134 134
Eswatini 134 134
Ethiopia 134 134
Fiji 134 134
Finland 134 134
France 1474 1474
Gabon 134 134
Gambia 134 134
Georgia 134 134
Germany 134 134
Ghana 134 134
Greece 134 134
Grenada 134 134
Guatemala 134 134
Guinea 134 134
Guinea-Bissau 134 134
Guyana 134 134
Haiti 134 134
Holy See 134 134
Honduras 134 134
Hungary 134 134
Iceland 134 134
India 134 134
Indonesia 134 134
Iran 134 134
Iraq 134 134
Ireland 134 134
Israel 134 134
Italy 134 134
Jamaica 134 134
Japan 134 134
Jordan 134 134
Kazakhstan 134 134
Kenya 134 134
Korea, South 134 134
Kosovo 134 134
Kuwait 134 134
Kyrgyzstan 134 134
Laos 134 134
Latvia 134 134
Lebanon 134 134
Lesotho 134 134
Liberia 134 134
Libya 134 134
Liechtenstein 134 134
Lithuania 134 134
Luxembourg 134 134
Madagascar 134 134
Malawi 134 134
Malaysia 134 134
Maldives 134 134
Mali 134 134
Malta 134 134
Mauritania 134 134
Mauritius 134 134
Mexico 134 134
Moldova 134 134
Monaco 134 134
Mongolia 134 134
Montenegro 134 134
Morocco 134 134
Mozambique 134 134
MS Zaandam 134 134
Namibia 134 134
Nepal 134 134
Netherlands 670 670
New Zealand 134 134
Nicaragua 134 134
Niger 134 134
Nigeria 134 134
North Macedonia 134 134
Norway 134 134
Oman 134 134
Pakistan 134 134
Panama 134 134
Papua New Guinea 134 134
Paraguay 134 134
Peru 134 134
Philippines 134 134
Poland 134 134
Portugal 134 134
Qatar 134 134
Romania 134 134
Russia 134 134
Rwanda 134 134
Saint Kitts and Nevis 134 134
Saint Lucia 134 134
Saint Vincent and the Grenadines 134 134
San Marino 134 134
Sao Tome and Principe 134 134
Saudi Arabia 134 134
Senegal 134 134
Serbia 134 134
Seychelles 134 134
Sierra Leone 134 134
Singapore 134 134
Slovakia 134 134
Slovenia 134 134
Somalia 134 134
South Africa 134 134
South Sudan 134 134
Spain 134 134
Sri Lanka 134 134
Sudan 134 134
Suriname 134 134
Sweden 134 134
Switzerland 134 134
Syria 134 134
Taiwan* 134 134
Tajikistan 134 134
Tanzania 134 134
Thailand 134 134
Timor-Leste 134 134
Togo 134 134
Trinidad and Tobago 134 134
Tunisia 134 134
Turkey 134 134
Uganda 134 134
Ukraine 134 134
United Arab Emirates 134 134
United Kingdom 1474 1474
Uruguay 134 134
US 134 134
US_state 436974 436974
Uzbekistan 134 134
Venezuela 134 134
Vietnam 134 134
West Bank and Gaza 134 134
Western Sahara 134 134
Yemen 134 134
Zambia 134 134
Zimbabwe 134 134
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 4805
Alaska 919
Arizona 1196
Arkansas 5055
California 4560
Colorado 4367
Connecticut 709
Delaware 295
Diamond Princess 79
District of Columbia 80
Florida 5110
Georgia 11459
Grand Princess 80
Guam 80
Hawaii 412
Idaho 2304
Illinois 6630
Indiana 6653
Iowa 6181
Kansas 5153
Kentucky 7426
Louisiana 4800
Maine 1220
Maryland 1865
Massachusetts 1197
Michigan 5830
Minnesota 5530
Mississippi 5953
Missouri 6655
Montana 2062
Nebraska 3902
Nevada 921
New Hampshire 840
New Jersey 1795
New Mexico 2036
New York 4555
North Carolina 7036
North Dakota 2430
Northern Mariana Islands 65
Ohio 6269
Oklahoma 4786
Oregon 2413
Pennsylvania 4936
Puerto Rico 80
Rhode Island 472
South Carolina 3467
South Dakota 3081
Tennessee 6786
Texas 14393
Utah 1097
Vermont 1125
Virgin Islands 80
Virginia 8892
Washington 3112
West Virginia 3278
Wisconsin 4778
Wyoming 1481
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 47097
China 134
Diamond Princess 115
Korea, South 105
Japan 104
Italy 102
Iran 99
Singapore 96
France 95
Germany 95
Spain 94
US 93
Switzerland 91
United Kingdom 91
Belgium 90
Netherlands 90
Norway 90
Sweden 90
Austria 88
Malaysia 87
Australia 86
Bahrain 86
Denmark 86
Canada 85
Qatar 85
Iceland 84
Brazil 83
Czechia 83
Finland 83
Greece 83
Iraq 83
Israel 83
Portugal 83
Slovenia 83
Egypt 82
Estonia 82
India 82
Ireland 82
Kuwait 82
Philippines 82
Poland 82
Romania 82
Saudi Arabia 82
Indonesia 81
Lebanon 81
San Marino 81
Thailand 81
Chile 80
Pakistan 80
Luxembourg 79
Peru 79
Russia 79
Ecuador 78
Mexico 78
Slovakia 78
South Africa 78
United Arab Emirates 78
Armenia 77
Colombia 77
Croatia 77
Panama 77
Serbia 77
Taiwan* 77
Turkey 77
Argentina 76
Bulgaria 76
Latvia 76
Uruguay 76
Algeria 75
Costa Rica 75
Dominican Republic 75
Hungary 75
Andorra 74
Bosnia and Herzegovina 74
Jordan 74
Lithuania 74
Morocco 74
New Zealand 74
North Macedonia 74
Vietnam 74
Albania 73
Cyprus 73
Malta 73
Moldova 73
Brunei 72
Burkina Faso 72
Sri Lanka 72
Tunisia 72
Ukraine 71
Azerbaijan 70
Ghana 70
Kazakhstan 70
Oman 70
Senegal 70
Venezuela 70
Afghanistan 69
Cote d’Ivoire 69
Cuba 68
Mauritius 68
Uzbekistan 68
Cambodia 67
Cameroon 67
Honduras 67
Nigeria 67
West Bank and Gaza 67
Belarus 66
Georgia 66
Bolivia 65
Kosovo 65
Kyrgyzstan 65
Montenegro 65
Congo (Kinshasa) 64
Kenya 63
Niger 62
Guinea 61
Rwanda 61
Trinidad and Tobago 61
Paraguay 60
Bangladesh 59
Djibouti 57
El Salvador 56
Guatemala 55
Madagascar 54
Mali 53
Congo (Brazzaville) 50
Jamaica 50
Gabon 48
Somalia 48
Tanzania 48
Ethiopia 47
Burma 46
Sudan 45
Liberia 44
Maldives 42
Equatorial Guinea 41
Cabo Verde 39
Sierra Leone 37
Guinea-Bissau 36
Togo 36
Zambia 35
Eswatini 34
Chad 33
Tajikistan 32
Haiti 30
Sao Tome and Principe 30
Benin 28
Nepal 28
Uganda 28
Central African Republic 27
South Sudan 27
Guyana 25
Mozambique 24
Yemen 20
Mongolia 19
Mauritania 16
Nicaragua 16
Malawi 10
Syria 10
Zimbabwe 8
Bahamas 7
Libya 7
Comoros 5
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 134
Korea, South 105
Japan 104
Italy 102
Iran 99
Singapore 96
France 95
Germany 95
Spain 94
US 93
Switzerland 91
United Kingdom 91
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-01-31        18292
## 2    Afghanistan           <NA> <NA> 2020-05-24        18406
## 3    Afghanistan           <NA> <NA> 2020-03-07        18328
## 4    Afghanistan           <NA> <NA> 2020-05-25        18407
## 5    Afghanistan           <NA> <NA> 2020-03-29        18350
## 6    Afghanistan           <NA> <NA> 2020-03-06        18327
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                      0                     0            NaN
## 2                    218                 10582     0.02060102
## 3                      0                     1     0.00000000
## 4                    219                 11173     0.01960082
## 5                      4                   120     0.03333333
## 6                      0                     1     0.00000000
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                      -Inf                       -Inf        18348
## 2                  4.024568                   2.338456        18348
## 3                  0.000000                       -Inf        18348
## 4                  4.048170                   2.340444        18348
## 5                  2.079181                   0.602060        18348
## 6                  0.000000                       -Inf        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1            -56  NA   NA         NA                           NA
## 2             58  NA   NA         NA                           NA
## 3            -20  NA   NA         NA                           NA
## 4             59  NA   NA         NA                           NA
## 5              2  NA   NA         NA                           NA
## 6            -21  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Retail_Recreation 0.9978008 -56.0
Hawaii Grocery_Pharmacy 0.9806468 -34.0
New Hampshire Parks 0.9562166 -20.0
Maine Transit -0.9146666 -50.0
Connecticut Grocery_Pharmacy -0.8908935 -6.0
Utah Residential -0.8903955 12.0
Vermont Parks 0.8578809 -35.5
South Dakota Parks 0.8447120 -26.0
Utah Transit -0.8105495 -18.0
Wyoming Parks -0.7897688 -4.0
Massachusetts Workplace -0.7567665 -39.0
Rhode Island Workplace -0.7539745 -39.5
Hawaii Residential -0.7533303 19.0
Alaska Residential 0.7516153 13.0
Connecticut Transit -0.7501968 -50.0
Alaska Grocery_Pharmacy -0.7454276 -7.0
Alaska Workplace -0.7426317 -33.0
Hawaii Parks 0.7173402 -72.0
Wyoming Transit -0.7099125 -17.0
Arizona Grocery_Pharmacy -0.7022082 -15.0
Hawaii Transit 0.6576423 -89.0
Vermont Grocery_Pharmacy -0.6541649 -25.0
Utah Parks -0.6539476 17.0
New York Workplace -0.6493932 -34.5
Maine Workplace -0.6362635 -30.0
Rhode Island Retail_Recreation -0.6296900 -45.0
Rhode Island Residential -0.6121146 18.5
Utah Workplace -0.6097315 -37.0
New Jersey Parks -0.6022759 -6.0
New York Retail_Recreation -0.5877693 -46.0
Nebraska Workplace 0.5845124 -32.0
North Dakota Parks 0.5709901 -34.0
New Jersey Workplace -0.5606563 -44.0
Arizona Retail_Recreation -0.5440058 -42.5
Arizona Residential 0.5299461 13.0
New York Parks 0.5288254 20.0
Nevada Transit -0.5187183 -20.0
Connecticut Residential 0.5180246 14.0
Connecticut Retail_Recreation -0.5127476 -45.0
Massachusetts Retail_Recreation -0.5057999 -44.0
Maine Parks 0.5054700 -31.0
North Dakota Retail_Recreation -0.4970692 -42.0
Hawaii Workplace 0.4965663 -46.0
New Jersey Grocery_Pharmacy -0.4824642 2.5
Montana Parks -0.4824210 -58.0
New Mexico Grocery_Pharmacy -0.4796212 -11.0
Connecticut Workplace -0.4790782 -39.0
Kansas Parks 0.4745543 72.0
West Virginia Parks 0.4744931 -33.0
Nebraska Residential -0.4741852 14.0
Montana Residential 0.4685175 14.0
Rhode Island Parks 0.4677024 52.0
Iowa Parks -0.4583242 28.5
New Jersey Retail_Recreation -0.4582406 -62.5
New Mexico Residential 0.4485387 13.5
Wyoming Workplace -0.4452572 -31.0
Idaho Residential -0.4385215 11.0
Illinois Transit -0.4380161 -31.0
Arizona Transit 0.4324011 -38.0
New Mexico Parks 0.4321583 -31.5
Pennsylvania Workplace -0.4311103 -36.0
South Carolina Workplace 0.4224839 -30.0
Maryland Workplace -0.4214547 -35.0
Vermont Residential 0.4213470 11.5
New Jersey Transit -0.4134691 -50.5
Massachusetts Grocery_Pharmacy -0.4131395 -7.0
New Hampshire Residential -0.4048538 14.0
Alabama Grocery_Pharmacy -0.4039052 -2.0
Kentucky Parks -0.4033151 28.5
Maryland Grocery_Pharmacy -0.3850651 -10.0
Pennsylvania Retail_Recreation -0.3782300 -45.0
Idaho Workplace -0.3774223 -29.0
California Transit -0.3766398 -42.0
Idaho Grocery_Pharmacy -0.3744024 -5.5
Alabama Workplace -0.3736764 -29.0
New York Transit -0.3714117 -48.0
Wisconsin Transit -0.3706441 -23.5
California Residential 0.3551857 14.0
Idaho Transit -0.3534268 -30.0
Michigan Parks 0.3531518 30.0
New Mexico Retail_Recreation -0.3528540 -42.5
Nebraska Grocery_Pharmacy 0.3402839 -0.5
Wyoming Grocery_Pharmacy -0.3356406 -10.0
North Carolina Grocery_Pharmacy 0.3239188 0.0
Alaska Retail_Recreation 0.3237866 -39.0
Maine Retail_Recreation -0.3115446 -42.0
Virginia Transit -0.3097871 -33.0
California Parks -0.3093848 -38.5
Minnesota Transit -0.3060800 -28.5
Vermont Retail_Recreation 0.3027924 -57.0
Maryland Retail_Recreation -0.3011925 -39.0
North Dakota Workplace 0.2957788 -40.0
Pennsylvania Parks 0.2923230 12.0
Montana Workplace -0.2916446 -40.0
Alabama Transit -0.2914585 -36.5
Florida Residential 0.2901902 14.0
Arkansas Retail_Recreation -0.2873305 -30.0
Colorado Residential 0.2867051 14.0
Oregon Grocery_Pharmacy 0.2852637 -7.0
Illinois Workplace -0.2827444 -30.5
North Carolina Workplace 0.2817684 -31.0
Mississippi Residential 0.2801978 13.0
Nevada Residential 0.2737241 17.0
Texas Residential -0.2736479 15.0
Rhode Island Transit -0.2692428 -56.0
Texas Workplace 0.2664941 -32.0
Georgia Grocery_Pharmacy -0.2658412 -10.0
Rhode Island Grocery_Pharmacy 0.2569036 -7.5
Maryland Residential 0.2558353 15.0
Vermont Workplace -0.2551549 -43.0
Kansas Workplace 0.2538084 -32.0
Missouri Residential -0.2516827 13.0
Florida Parks -0.2510542 -43.0
West Virginia Grocery_Pharmacy -0.2505455 -6.0
Tennessee Workplace -0.2496062 -31.0
Arkansas Residential 0.2493457 12.0
Illinois Parks 0.2457822 26.5
Nevada Retail_Recreation -0.2450442 -43.0
Idaho Retail_Recreation -0.2419329 -39.5
Pennsylvania Grocery_Pharmacy -0.2413544 -6.0
South Carolina Parks -0.2364358 -23.0
Tennessee Residential 0.2339231 11.5
New York Grocery_Pharmacy -0.2309366 8.0
Georgia Workplace -0.2186797 -33.5
Utah Retail_Recreation -0.2184129 -40.0
Montana Transit 0.2118475 -41.0
Wisconsin Parks 0.2099115 51.5
Illinois Residential 0.2029153 14.0
Michigan Workplace -0.2024396 -40.0
Georgia Retail_Recreation -0.2000185 -41.0
North Carolina Transit 0.1991395 -32.0
Mississippi Grocery_Pharmacy -0.1977919 -8.0
Kansas Grocery_Pharmacy -0.1932347 -14.0
South Dakota Workplace 0.1925334 -35.0
North Dakota Grocery_Pharmacy -0.1924872 -8.0
West Virginia Workplace 0.1920828 -33.0
Texas Parks 0.1910788 -42.0
Colorado Parks -0.1903357 2.0
Kentucky Transit -0.1880457 -31.0
Washington Grocery_Pharmacy 0.1868617 -7.0
Virginia Residential 0.1828733 14.0
Minnesota Parks 0.1811445 -9.0
North Carolina Residential 0.1810609 13.0
Wisconsin Residential -0.1792091 14.0
New Jersey Residential 0.1776270 18.0
California Grocery_Pharmacy -0.1774141 -11.5
Connecticut Parks 0.1755438 43.0
Washington Workplace -0.1717082 -38.0
California Retail_Recreation -0.1693774 -44.0
Kentucky Grocery_Pharmacy 0.1689849 4.0
Iowa Transit 0.1671307 -24.0
New Mexico Transit 0.1645164 -38.5
New Hampshire Retail_Recreation -0.1605983 -41.0
Massachusetts Parks 0.1604041 39.0
Pennsylvania Transit -0.1595062 -42.0
Missouri Workplace 0.1590797 -28.0
Ohio Transit 0.1578772 -28.0
Indiana Residential 0.1538474 12.0
Georgia Residential -0.1537190 13.0
Indiana Retail_Recreation 0.1506399 -38.0
Florida Retail_Recreation 0.1479572 -43.0
South Dakota Transit -0.1471189 -40.0
Virginia Grocery_Pharmacy -0.1456281 -8.0
Oregon Transit 0.1444970 -27.5
Alabama Parks 0.1387564 -1.0
Nebraska Transit -0.1384569 -9.0
Mississippi Retail_Recreation -0.1384549 -40.0
Michigan Retail_Recreation -0.1372049 -53.0
South Dakota Residential 0.1361254 15.0
North Dakota Transit 0.1319685 -48.0
Mississippi Transit -0.1318720 -38.5
Oregon Workplace -0.1315716 -31.0
Arkansas Parks 0.1298411 -12.0
Washington Residential 0.1297233 13.0
Virginia Parks 0.1294408 6.0
California Workplace -0.1285043 -36.0
Wisconsin Workplace -0.1249426 -31.0
Alabama Retail_Recreation 0.1232321 -39.0
Minnesota Retail_Recreation 0.1195566 -40.0
Oregon Retail_Recreation 0.1191994 -40.5
Nevada Workplace -0.1154817 -40.0
Massachusetts Transit -0.1153366 -45.0
Ohio Parks -0.1140944 67.5
New Hampshire Grocery_Pharmacy -0.1132360 -6.0
Wisconsin Grocery_Pharmacy 0.1124157 -1.0
Wyoming Residential 0.1072399 12.5
Idaho Parks 0.1066545 -22.0
Nebraska Retail_Recreation 0.1058669 -36.0
Indiana Parks -0.1056155 29.0
Kansas Transit -0.1055316 -26.5
Texas Transit 0.1053948 -41.0
Arizona Parks -0.1044400 -44.5
Maryland Transit -0.1040511 -39.0
Texas Grocery_Pharmacy 0.1036361 -14.0
Massachusetts Residential 0.1027139 15.0
Arkansas Workplace -0.1017537 -26.0
Ohio Residential 0.0998167 14.0
Oklahoma Grocery_Pharmacy -0.0991393 -0.5
Kentucky Residential 0.0986728 12.0
Minnesota Workplace -0.0979903 -33.0
Oregon Residential 0.0941238 10.5
North Carolina Parks -0.0938300 7.0
North Dakota Residential -0.0935617 17.0
New York Residential 0.0926346 17.5
South Carolina Residential -0.0923748 12.0
Tennessee Parks -0.0909309 10.5
Oklahoma Residential 0.0891665 15.0
Georgia Parks 0.0882956 -6.0
Wyoming Retail_Recreation -0.0881894 -39.0
Arkansas Transit 0.0851358 -27.0
Indiana Workplace 0.0847900 -34.0
Missouri Transit -0.0839377 -24.5
Pennsylvania Residential 0.0835947 15.0
Virginia Workplace -0.0832074 -31.5
Maine Residential -0.0822658 11.0
Nevada Parks 0.0814815 -12.5
Arizona Workplace -0.0801403 -35.0
Washington Transit -0.0796160 -33.5
Michigan Residential 0.0779475 15.0
Oklahoma Workplace 0.0777974 -31.0
Virginia Retail_Recreation -0.0767574 -35.0
Ohio Grocery_Pharmacy 0.0758745 0.0
South Carolina Transit 0.0724508 -45.0
Michigan Transit 0.0721568 -46.0
Florida Workplace -0.0716393 -33.0
Indiana Grocery_Pharmacy -0.0706677 -5.5
Kentucky Retail_Recreation 0.0703894 -29.0
North Carolina Retail_Recreation 0.0703001 -34.0
Minnesota Grocery_Pharmacy 0.0683867 -6.0
Michigan Grocery_Pharmacy -0.0676157 -11.0
Colorado Transit 0.0674907 -36.0
Minnesota Residential -0.0674156 17.0
Mississippi Parks -0.0632962 -25.0
Florida Grocery_Pharmacy 0.0629509 -14.0
Utah Grocery_Pharmacy 0.0619806 -4.0
Mississippi Workplace -0.0595726 -33.0
Texas Retail_Recreation 0.0590620 -40.0
Iowa Workplace 0.0583961 -30.0
Montana Grocery_Pharmacy -0.0579555 -16.0
Iowa Grocery_Pharmacy -0.0573216 4.0
Ohio Retail_Recreation 0.0572350 -36.0
Montana Retail_Recreation -0.0570374 -51.0
Washington Parks 0.0568919 -3.5
Missouri Parks 0.0546480 0.0
West Virginia Transit -0.0546047 -45.0
Maine Grocery_Pharmacy -0.0509730 -13.0
West Virginia Residential -0.0492926 11.0
Kentucky Workplace -0.0490888 -36.5
South Dakota Retail_Recreation -0.0481312 -39.0
South Carolina Grocery_Pharmacy 0.0477584 1.0
Oregon Parks 0.0477543 16.5
Illinois Grocery_Pharmacy -0.0465500 2.0
Alabama Residential -0.0463279 11.0
New Hampshire Workplace 0.0409366 -37.0
Wisconsin Retail_Recreation 0.0367180 -44.0
South Carolina Retail_Recreation -0.0353652 -35.0
Indiana Transit 0.0344657 -29.0
Ohio Workplace -0.0336363 -35.0
Colorado Grocery_Pharmacy -0.0308997 -17.0
Missouri Retail_Recreation -0.0259241 -36.5
Washington Retail_Recreation -0.0256724 -42.0
Illinois Retail_Recreation 0.0250782 -40.0
Colorado Retail_Recreation -0.0247330 -44.0
Tennessee Grocery_Pharmacy 0.0241007 6.0
New Hampshire Transit -0.0229318 -57.0
Florida Transit -0.0180313 -49.0
New Mexico Workplace 0.0171926 -34.0
South Dakota Grocery_Pharmacy 0.0169836 -9.0
Kansas Residential -0.0162219 13.0
Georgia Transit -0.0152421 -35.0
Oklahoma Retail_Recreation 0.0150096 -31.0
West Virginia Retail_Recreation -0.0137251 -38.5
Tennessee Transit 0.0135762 -32.0
Vermont Transit -0.0129962 -63.0
Iowa Residential -0.0128833 13.0
Oklahoma Parks 0.0109946 -18.5
Oklahoma Transit 0.0105967 -26.0
Arkansas Grocery_Pharmacy -0.0102151 3.0
Kansas Retail_Recreation -0.0097880 -38.0
Missouri Grocery_Pharmacy 0.0088072 2.0
Colorado Workplace 0.0081260 -39.0
Tennessee Retail_Recreation -0.0080017 -30.0
Iowa Retail_Recreation -0.0045849 -38.0
Nevada Grocery_Pharmacy -0.0038583 -12.5
Nebraska Parks -0.0020738 55.5
Maryland Parks -0.0000770 27.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Thu Jun 4 17:53:25 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net